scGate to annotate integrated scRNA-seq datasets

A typical task in single-cell analysis is cell type annotation of datasets composed of multiple samples. You may have used one of several tools for batch-effect correction to integrate samples from different sources and technologies, and generated a combined map. In this demo we will show how scGate can help you annotate this integrated map, by using simple, customizable models based on standard gene markers from literature. We will show the case of a PBMC dataset integrated either with STACAS or Harmony, but the same applies to different integration tools.

Set up the environment

library(renv)
renv::activate()
renv::restore()

library(ggplot2)
library(dplyr)
library(patchwork)
library(Seurat)
library(harmony)

#Packages from GitHub
remotes::install_github('satijalab/seurat-data')
remotes::install_github("carmonalab/scGate")
remotes::install_github("carmonalab/STACAS")

library(scGate)
library(SeuratData)
library(STACAS)

Get a test dataset

Download the dataset of PBMCs (SCP424) distributed with SeuratData. For more information on this dataset you can do ?pbmcsca

options(timeout = max(300, getOption("timeout")))
InstallData("pbmcsca")
data("pbmcsca")

scGate on STACAS-integrated object

data("pbmcsca")
pbmcsca <- NormalizeData(pbmcsca)

pbmc.list <- SplitObject(pbmcsca, split.by = "Method")
pbmc.stacas <- Run.STACAS(pbmc.list, anchor.features = 2000)
pbmc.stacas <- ScaleData(pbmc.stacas) %>%
    RunPCA() %>%
    RunUMAP(dims = 1:30)
DimPlot(pbmc.stacas, group.by = "Method") + theme(aspect.ratio = 1)

We can run scGate directly on this integrated space, for instance to isolate NK cells

models.db <- scGate::get_scGateDB()
model.NK <- models.db$human$generic$NK

pbmc.stacas <- scGate(pbmc.stacas, model = model.NK, reduction = "pca", ncores = 4,
    output.col.name = "NK")

We can compare the automatic filtering to the “CellType” manual annotation by the authors:

DimPlot(pbmc.stacas, group.by = c("NK", "CellType"), ncol = 2) + theme(aspect.ratio = 1)

New models can be easily defined based on cell type-specific markers from literature. For instance, we can set up a new simple model to identify Megakaryocytes:

model.MK <- scGate::gating_model(name = "Megakaryocyte", signature = c("ITGA2B",
    "PF4", "PPBP"))

pbmc.stacas <- scGate(pbmc.stacas, model = model.MK, reduction = "pca", ncores = 4,
    output.col.name = "Megakaryocyte")
DimPlot(pbmc.stacas, group.by = c("Megakaryocyte", "CellType"), ncol = 2) + theme(aspect.ratio = 1)

We can also run multiple gating models at once. Besides pure/impure classifications for each model, scGate will also return a combined annotation based on all the models we provided. In this setting, scGate can be used as a multi-classifier to automatically annotate datasets:

models.db <- scGate::get_scGateDB()

models.hs <- models.db$human$generic
models.list <- models.hs[c("Bcell", "CD4T", "CD8T", "MoMacDC", "Plasma_cell", "NK",
    "Erythrocyte", "Megakaryocyte")]

pbmc.stacas <- scGate(pbmc.stacas, model = models.list, reduction = "pca", ncores = 4)
DimPlot(pbmc.stacas, group.by = c("Method", "CellType", "scGate_multi"), ncol = 3) +
    theme(aspect.ratio = 1)

UCell scores for individual signatures are also available in metadata (**_UCell* columns)

FeaturePlot(pbmc.stacas, ncol = 3, features = c("Tcell_UCell", "CD4T_UCell", "CD8T_UCell",
    "MoMacDC_UCell", "pDC_UCell", "Bcell_UCell"))

scGate on Harmony-integrated object

A very popular tool for single-cell data integration is Harmony. The RunHarmony() function provides a convenient wrapper to integrate samples stored in a single Seurat object:

pbmcsca <- NormalizeData(pbmcsca) %>%
    FindVariableFeatures() %>%
    ScaleData() %>%
    RunPCA(npcs = 30)
pbmc.harmony <- RunHarmony(pbmcsca, group.by.vars = "Method")

The corrected embeddings after batch effect correction will be stored in the ‘harmony’ reduction:

pbmc.harmony <- RunUMAP(pbmc.harmony, reduction = "harmony", dims = 1:30)

Let’s apply scGate in this space to isolate high-quality T cells:

models.db <- scGate::get_scGateDB()
model.Tcell <- models.db$human$generic$Tcell

pbmc.harmony <- scGate(pbmc.harmony, model = model.Tcell, reduction = "harmony",
    ncores = 4, output.col.name = "Tcell")
DimPlot(pbmc.harmony, group.by = c("Tcell", "CellType"), ncol = 2) + theme(aspect.ratio = 1)

We can also run multiple gating models at once. Besides pure/impure classifications for each model, scGate will also return a consensus annotation based on all the models we provided. In this setting, scGate can be used as a multi-classifier to automatically annotate datasets:

models.db <- scGate::get_scGateDB()

models.hs <- models.db$human$generic
models.list <- models.hs[c("Bcell", "CD4T", "CD8T", "MoMacDC", "Plasma_cell", "NK",
    "Erythrocyte", "Megakaryocyte")]

pbmc.harmony <- scGate(pbmc.harmony, model = models.list, reduction = "harmony",
    ncores = 4)
DimPlot(pbmc.harmony, group.by = c("Method", "CellType", "scGate_multi"), ncol = 3) +
    theme(aspect.ratio = 1)

Final notes

scGate can be applied as a ‘quality check’ on invidual samples, to purify a cell population of interest and remove contaminants, prior to more advanced steps in single-cell data analysis (e.g. integration, clustering, differential gene expression, etc.). However, it is becoming increasingly common for analysts to begin working on pre-integrated collections of datasets, for instance when assembled and published by other research groups. As we have shown here, scGate can be applied directly on integrated objects, and their low-dimensional representations, to aid the annotation of cell types based on known gene markers.

By default, scGate calculates PCA embeddings from normalized feature counts, and repeats this operation for each hierarchical level of a gating model. Signature scores for each cell (calculated using UCell) are smoothed by the scores of the neighboring cells, and used to determine whether a given cell “passes the gate”. While such neighbor smoothing is generally more accurate when recalculated at each level of gating, it can be costly in terms of computing time. Providing a precalculated “reduction” to scGate, as shown in this demo, can significantly speed up computation and take advantage of dimensionality reductions in integrated space. The user must be aware, however, that gating results in the original or integrated space will, in general, differ; if batch effect correction introduces distorsions in the integrated space, this will reflect in the nearest neighbors of cells across samples, and as a consequence on the signature scores used for gating.

Further reading

The scGate package and installation instructions are available at: scGate package

The code for this demo can be found on GitHub

The repository for scGate gating moels is at: scGate models repository

References

Ding, Jiarui, et al. “Systematic comparison of single-cell and single-nucleus RNA-sequencing methods.” Nature biotechnology 38.6 (2020): 737-746.

Andreatta, Massimo, and Santiago J. Carmona. “STACAS: Sub-Type Anchor Correction for Alignment in Seurat to integrate single-cell RNA-seq data.” Bioinformatics 37.6 (2021): 882-884.

Korsunsky, Ilya, et al. “Fast, sensitive and accurate integration of single-cell data with Harmony.” Nature methods 16.12 (2019): 1289-1296.